Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update backend.py #105

Closed
wants to merge 13 commits into from
Closed

Update backend.py #105

wants to merge 13 commits into from

Conversation

mavaylon1
Copy link
Contributor

@mavaylon1 mavaylon1 commented Jul 12, 2023

  • Update I/O for Zarr to support the linkage of ExternalResources (bringing this inline with hdmf)
  • Remove python 3.7 support

How to test the behavior?

Show how to reproduce the new behavior (can be a bug fix or a new feature)

Checklist

  • Did you update CHANGELOG.md with your changes?
  • Have you checked our Contributing document?
  • Have you ensured the PR clearly describes the problem and the solution?
  • Is your contribution compliant with our coding style? This can be checked running ruff from the source directory.
  • Have you checked to ensure that there aren't other open Pull Requests for the same change?
  • Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

@codecov-commenter
Copy link

codecov-commenter commented Jul 12, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.19 ⚠️

Comparison is base (13b92e8) 81.97% compared to head (a9689be) 81.78%.

Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #105      +/-   ##
==========================================
- Coverage   81.97%   81.78%   -0.19%     
==========================================
  Files          11       11              
  Lines        2663     2663              
==========================================
- Hits         2183     2178       -5     
- Misses        480      485       +5     
Impacted Files Coverage Δ
src/hdmf_zarr/backend.py 87.87% <100.00%> (ø)

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@mavaylon1 mavaylon1 changed the title Update backend.py Update backend.py and remove python 3.7 Jul 12, 2023
@mavaylon1
Copy link
Contributor Author

Fix #106

@mavaylon1
Copy link
Contributor Author

mavaylon1 commented Jul 12, 2023

@oruebel After the updates, the gallery is failing with

Exception: could not resolve dtype for VectorIndex 'spike_times_index'

Should we make the fix for that separate on its own ticket? I also believe this is linked to the external links workflow failing.
Other than that it seems bringing hdmf-zarr to hdmf 3.7 works.

@oruebel
Copy link
Contributor

oruebel commented Jul 12, 2023

Exception: could not resolve dtype for VectorIndex 'spike_times_index'

That looks like another error due to hdmf-dev/hdmf#848 triggering the rebuild of Containers such that we have to determine the data type rather than remembering the existing type. I suspect that this will most likely require a fix in the ObjectMapper.convert_dtype function in HDMF and a corresponding bug fix release for HDMF. I.e., I would suggest we do that first instead of releasing an update here with a known bug.

@mavaylon1
Copy link
Contributor Author

Exception: could not resolve dtype for VectorIndex 'spike_times_index'

That looks like another error due to hdmf-dev/hdmf#848 triggering the rebuild of Containers such that we have to determine the data type rather than remembering the existing type. I suspect that this will most likely require a fix in the ObjectMapper.convert_dtype function in HDMF and a corresponding bug fix release for HDMF. I.e., I would suggest we do that first instead of releasing an update here with a known bug.

Right that makes sense, but I was suggesting we merge this and handle it separately before a release. @oruebel

@oruebel
Copy link
Contributor

oruebel commented Jul 12, 2023

I was suggesting we merge this and handle it separately before a release

Let me check if I can figure this one out tomorrow so we can at least get an estimate of how hard it will be to fix this bug.

@mavaylon1
Copy link
Contributor Author

I was suggesting we merge this and handle it separately before a release

Let me check if I can figure this one out tomorrow so we can at least get an estimate of how hard it will be to fix this bug.

@oruebel thoughts?

requirements.txt Outdated Show resolved Hide resolved
@oruebel
Copy link
Contributor

oruebel commented Jul 18, 2023

When running plot_convert_nwb_hdf5.py from the gallery with the test_gallery.py script I get a different error then when running from the script from the command-line. I'm wondering whether the either the test_gallery.py is picking up an older version of HDMF or whether something has changed on the dev branch of HDMF.

@mavaylon1 mavaylon1 mentioned this pull request Jul 21, 2023
15 tasks
@mavaylon1
Copy link
Contributor Author

mavaylon1 commented Jul 25, 2023

  1. python plot_convert_nwb_hdf5.py (w/ clear cache)
    Calls export (goes through a bunch of super() calls to inherited export) > the export in io.py calls "manager.build" > which goes to type_map build > which goes to obj_mapper build >

Now upon initial inspection of the traceback

  File "/Users/mavaylon/Research/NWB/hdmf2/hdmf/src/hdmf/backends/hdf5/h5tools.py", line 1468, in __list_fill__
    raise e
  File "/Users/mavaylon/Research/NWB/hdmf2/hdmf/src/hdmf/backends/hdf5/h5tools.py", line 1465, in __list_fill__
    dset[:] = data
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/Users/mavaylon/opt/anaconda3/envs/NWB2/lib/python3.10/site-packages/h5py/_hl/dataset.py", line 1009, in __setitem__
    self.id.write(mspace, fspace, val, mtype, dxpl=self._dxpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5d.pyx", line 280, in h5py.h5d.DatasetID.write
  File "h5py/_proxy.pyx", line 145, in h5py._proxy.dset_rw
  File "h5py/_conv.pyx", line 444, in h5py._conv.str2vlen
  File "h5py/_conv.pyx", line 95, in h5py._conv.generic_converter
  File "h5py/_conv.pyx", line 249, in h5py._conv.conv_str2vlen
TypeError: Can't implicitly convert non-string objects to strings

When you add a breakpoint in line 1467 of h5tools.py, you are confronted with an np.array of group builders. These builders are for the electrode group ADunit_32. The fact that it's a np.array is not correct (via the advice of Ryan) instead of a dataset of references.

Update/Edit: Looking at the same set of builders, it is a np.array of Builders either way and so I don't think the format is wrong. You can verify by the original breakpoint at 1467 on h5tools and then by removing the cache and setting a breakpoint at 1209 on h5tools and looking at 'data'. Both are np.arrays of Builders. What's even funnier is the _dtype for dset is "dtype('O')" (type n a few times), which to me is weird since that's what were trying to avoid (hence Oliver's change).

Now 1467 is within "list_fill" which is called by "write_dataset" (still within h5tools). The part of write_dataset you'd think we'd be in would be line 1265 to write an array of object references, BUT we are in line 1301. The options['dtype'] returned is utf8, which remind me of Oliver's addition of

elif np.issubdtype(value.dtype, np.dtype('O')):
                    # Only variable-length strings should ever appear as generic objects.
                    # Everything else should have a well-defined type
                    ret_dtype = 'utf8'

This options['dtype'] shouldn't be utf8 from my understanding.

Now what about of we removed the clear cache. What would the options['dtype'] be? It is object. (Add a breakpoint at 1208 and see)

So they are different and we need to figure out what is setting the dtype.
Going back to Oliver's change, why was it added in the first place?

@mavaylon1
Copy link
Contributor Author

mavaylon1 commented Jul 25, 2023

  1. Exploring more into where 'dtype' is set, it seems to be set on line 744 on obj_mapper, which is in build.
    So to connect back to the traceback flow where it leaves with calling build from obj_mapper. However, we don't seem to even enter the conditional that sets dtype='object'.

Update:
It goes through line 747 on obj_map where it calls convert_dtype which is where Oliver's code sets it to utf8. (with cache)

@mavaylon1
Copy link
Contributor Author

mavaylon1 commented Jul 26, 2023

  1. I ran
import logging

logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

ch = logging.FileHandler('test.log', mode='w')
ch.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(name)s - %(levelname)s - %(message)s')
ch.setFormatter(formatter)
logger.addHandler(ch)

I compared the two log files (with and without the cache) and there differences (one makes a new root while the non-cache reuses). I will take a look more tomorrow, but any insight would be great. @oruebel

@oruebel
Copy link
Contributor

oruebel commented Jul 26, 2023

Going back to Oliver's change, why was it added in the first place

On export when the cache is being cleared, it means that HDMF is now constructing a new set of builders. On the "original" builders that were read from file (e.g., HDF5) the dtype is already set and so when we export the builders we don't need to determine the dtype. However, when the cache is being cleared and the builders are being rebuilt, then that means that we now need to determine the dtype from the array (e.g, the h5py.dataset). I.e., when we clear the cache the code takes a different path because it now needs to determine the dtypes to use for all the datasets, rather than being able to use the original dtypes. @rly why is the cache being cleared again for export?

So they are different and we need to figure out what is setting the dtype.

Correct, I believe the issue is that when export added to clear the cache it now means we are hitting corner cases in the logic to determine dtypes in the ObjectMapping that we did not encounter before. If I understand it correctly, you are seeing that the dtype is set to the wrong type by the code I added to handle the case where we have an object-type numpy array, i.e., it seems like we either: 1) need some logic here to be able to distinguish between an numpy array of references vs. a numpy array of strings or 2) need to figure out how to keep information about the original dtypes that are being used around so that we don't have to guess the dtype (e.g., either be removing the clear cache or having an option to rebuild builders but keeping dtype information from the previous builders).

@mavaylon1 mavaylon1 mentioned this pull request Aug 19, 2023
6 tasks
@rly
Copy link
Contributor

rly commented Oct 1, 2023

@mavaylon1 Now that removing python 3.7 testing was done in a separate PR, is this PR still necessary? I merged dev into here and the changes do not look relevant anymore.

@rly rly changed the title Update backend.py and remove python 3.7 Update backend.py Oct 1, 2023
@mavaylon1 mavaylon1 closed this Nov 5, 2023
@mavaylon1 mavaylon1 deleted the io_er branch April 11, 2024 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants